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Folding uncertainty in theoretical models into Bayesian parameter estimation is necessary in 
order to make reliable inferences. A general means of achieving this is by marginalising over model 
uncertainty using a prior distribution constructed using Gaussian process regression (GPR). As an 
example, we apply this technique to the measurement of chirp mass using (simulated) gravitational- 
wave signals from binary black holes that could be observed using advanced-era gravitational-wave 
detectors. Unless properly accounted for, uncertainty in the gravitational-wave templates could 
be the dominant source of error in studies of these systems. We explain our approach in detail 
and provide proofs of various features of the method, including the limiting behaviour for high 
signal-to-noise, where systematic model uncertainties dominate over noise errors. We find that 
the marginalised likelihood constructed via GPR offers a significant improvement in parameter 
estimation over the standard, uncorrected likelihood both in our simple one-dimensional study, and 
theoretically in general. We also examine the dependence of the method on the size of training set 
used in the GPR; on the form of covariance function adopted for the GPR, and on changes to the 
detector noise power spectral density. 

PACS numbers: 04.30.-w, 04.30.Tv, 04.70.-s, 04.80.Nn 


I. INTRODUCTION 

The era of advanced ground-based interferometric 
gravitational-wave (GW) detectors is here. The Ad¬ 
vanced LIGO detectors mm in the USA started ob¬ 
serving in September 2015, whilst the Advanced Virgo 
detector 13 m in Europe is expected to come online 
shortly afterwards [S]. The principal target sources of 
GWs for these detectors are the coalescences of pairs of 
compact objects, either neutron stars or black holes. For 
all sources there is great uncertainty in the quoted event 
rate estimates, at least an order of magnitude in either 
direction [6], but regardless of the astrophysical uncer¬ 
tainty, the prospect of a first detection is imminent.^ 

The detection of GW signals is most efficient when we 
have accurate waveform models that can be matched to 
any signals in the (noisy) detector data.^ For parameter 
estimation (PE), it is even more important that the tem¬ 
plate faithfully matches the true signal, as otherwise we 
could infer biased parameter values. Matching a template 
to GW data requires that the model waveform remains 
accurate over the entire duration of the signal; typically 
of the order of hundreds of seconds for neutron-star bi- 
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^ It is possible to detect GWs without templates by looking for co¬ 
herent excess power in the detectors, e.g., [HHIo]. This is effective 
for short-duration signals corresponding to high-mass binaries. 


naries and tens of seconds for black-hole binaries in the 
advanced-detector era (with frequency sensitivity down 
to 10 Hz). Although higher mass sources have shorter 
waveforms (in the detector band), these present more of 
a challenge for modelling as they have detectable merger 
and ringdown components. In contrast, binary neutron 
stars only have the (easier to model) inspiral part of the 
waveform in band. In this paper, we are concerned with 
problems that arise from inaccurate models; therefore, 
we focus our attention on black-hole binaries where the 
issue of waveform uncertainty is most acute. However, 
the techniques we develop could equally be applied to 
neutron-star binaries or any other uncertain signal. In¬ 
accurate waveform models are known to cause signifi¬ 
cant systematic errors when recovering source parame¬ 
ters from observations with both ground-based m and 
space-based GW detectors [T^ . 

There are two problems that arise when using inac¬ 
curate signal models: the detection problem, and the 
PE problem. The detection problem is that the inac¬ 
curate model does not perfectly match to the physical 
waveform, leading to a loss of signal-to-noise (SNR) ra¬ 
tio and, hence, a lower chance of detection (for the same 
false alarm probability). The PE problem, which is the 
focus of this paper, is that the model waveform which has 
the best overlap with the physical signal in the data gen¬ 
erally has parameter values offset from the true source 
parameters, leading to a systematic error in any param¬ 
eter estimates. 

Recently, some of the authors proposed a novel method 
of improving the detection and PE prospects of com¬ 
plicated physical phenomena in noisy data |13) . The 
method applies generally to any situation where accurate 
models of the signal are available, but computational con- 




2 


straints mean that routine detection and PE tasks must 
be carried out with cheaper, less accurate, models. 

In the case of binary black-holes, there are many differ¬ 
ent physical effects that should be included in waveform 
models, such as the merger and ringdown phases follow¬ 
ing the inspiral, the presence of generic spins and pre¬ 
cession, eccentricity and higher-order modes. All these 
phenomena can, in principle, be simulated, thanks to 
recent rapid progress in numerical relativity (NR) |14P 
ng. However, NR simulations are extremely expensive, 
only a few hundred have been performed to date (see 
[muHi and references therein), and these typically con¬ 
sist of only the final few tens of orbits (although, see 
m)- Detection and PE are therefore currently per¬ 
formed using less expensive waveform approximants. The 
existence of NR waveforms has permitted the calibration 
of analytic inspiral-merger-ringdown approximants such 
as the effective-one-body-NR (EOBNR) [201423] or IMR- 
Phenom [24] families, with recent efforts concentrating on 
including the effects of precession in these For 

some recent PE work with inspiral merger and ringdown 
waveform models see [29H3T]. Historically, PE has used 
models based on the post-Newtonian formalism [32], such 
as the TaylorF2 and TaylorT2 waveforms |33|. Despite 
these lacking some of the relevant features, they are suf¬ 
ficiently quick to calculate that they can be simply used 
in PE algorithms. 

To include uncertainty in waveform templates whilst 
minimising computational expense, we use Gaussian pro¬ 
cess regression (GPR) to estimate the effects of waveform 
errors. The method involves constructing a training set of 
the waveform differences between an expensive, accurate 
waveform and a cheaper, less accurate waveform. For the 
accurate waveforms it might be necessary to use some 
combination of NR and the NR calibrated approximants 
discussed above, depending on the numbers and length 
of the available NR simulations. The waveform differ¬ 
ence is evaluated at a relatively small number of points 
in parameter space and stored for later use. GPR is then 
used to interpolate the difference across parameter space 
to give a best estimate and a corresponding uncertainty 
at a general point in parameter space. This interpolation 
provides a prior probability distribution on the waveform 
difference which is then used in marginalising the likeli¬ 
hood over waveform uncertainty. The result is an expres¬ 
sion for the likelihood in terms of the cheaper waveform 
model, but with corrections coming from the training 
set. This marginalised likelihood is negligibly more com¬ 
plicated or computationally expensive to evaluate than 
the standard expression, but provides a better estimate 
of the true likelihood surface (and hence the posterior), 
factoring in our imperfect knowledge of the waveform. 
Therefore, we have not only built a relatively inexpen- 


® See |28| for a study of systematic error (or lack thereof) from 
using EOBNR waveforms with NR injections. 


sive waveform approximant that can include additional 
physics, but we have also accounted for (marginalised 
over) the uncertainty in our new approximant. 

If the standard likelihood with an approximate wave¬ 
form is used for PE then, in general, biased parameter 
estimates are obtained.'^ It has recently been shown by 
some of the authors [36] that, under certain conditions, 
this bias is completely removed by the marginalised likeli¬ 
hood, and, more generally, that the bias is always reduced 
by the marginalised likelihood. 

The technique of GPR assumes that the data in the 
training set have been drawn from a Gaussian process 
(GP) on the parameter space with a mean and a covari¬ 
ance function either specified a priori or estimated from 
the training set itself. The interpolation is then achieved 
by calculating the conditional probability for the GP at 
some new parameter point given the known training set 
values, the mean and the covariance. GPR provides a 
convenient non-parametric way to interpolate the wave¬ 
form differences, and has the additional advantage that, 
by construction, it provides a Gaussian probability dis¬ 
tribution for the unknown waveform difference which can 
be analytically marginalised over. This is important be¬ 
cause it means no extra nuisance parameters are added to 
the PE task which would slow down an already expensive 
process. 

The outline of this paper is as follows. In Sec. [IT] the 
concept of the marginalised likelihood is introduced and 
the use of GPR in its construction is described in de¬ 
tail; we limit ourselves to interpolation across parameter 
space and not across frequency. The main choice made 
in implementing GPR is the specification of the covari¬ 
ance function; Sec. in discusses how the properties of 
the covariance function affect the properties of the cor¬ 
responding GP, and the effects of different choices of co- 
variance function are examined in a toy one-dimensional 
GPR problem. The marginalised likelihood possesses 
several properties which make it appealing for GW as¬ 
tronomy; Sec. jlVj presents proofs of these and discussions 
of their significance. In Sec. [V] the implementation of 
the marginalised likelihood is described for an illustrative 
one-dimensional example; here, properties of the inter¬ 
polated waveforms are examined and PE results for the 
marginalised likelihood are also presented. Additional 
material on the effect of changing the detector noise prop¬ 
erties on the interpolated waveforms are considered in 
App. [^ Finally, concluding remarks and a discussion of 
future directions for implementing the marginalised like- 


^ This is commonly assessed in the GW literature using 
probability-probability (P-P) plots 13411351 : for a catalogue of 
events, these plot the cumulative fraction of events where the 
true parameter is found within the credible interval correspond¬ 
ing to a given probability. If the posteriors are well calibrated, 
then a proportion P should fall in the P credible interval, and 
the plot is a diagonal line. Introducing bias means that the line 
sags below the optimal diagonal. 
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lihood are presented in Sec. VI 


II. THE METHOD 


In this section we detail how we incorporate waveform 
uncertainties into GW data analysis. The material pre¬ 
sented is an expansion of that in [T3]. In Sec. IIA we 
introduce the standard likelihood function and show how 
model uncertainties can be treated like nuisance parame¬ 
ters that can be integrated out (marginalised over). Per¬ 
forming this integration requires that a prior probability 
distribution is specified for the model uncertainties, this 
is constructed using GPR. This is introduced in Sec. |IIB| 
where we briefly summarise some key results pertaining 
to GPR; further details can be found in standard text¬ 
books (e.g., [37H39]). The result of the integration is 
the marginalised likelihood presented in Eq. (28) which 


accurately encodes our state of knowledge of the signal 
parameters, given our imperfect waveform models and 
the noisy data. 


A. The marginalised likelihood 


We consider the scenario where we can construct two 
different waveform models, one accurate but computa¬ 
tionally expensive, the other less accurate but quick to 
calculate. We use the parameters vector A to fully char¬ 
acterise the GW signal; Latin indices from the beginning 
of the alphabet (a, b, ...) will be used to label the dif¬ 
ferent components of this vector, and repeated indices 
should be summed over. The accurate waveforms will 
be referred to as the exact waveform h{X), although the 
method does not require that the accurate waveforms are 


perfect (see Sec. IIIG). The cheaper approximate wave¬ 


form H{X) is related to h{X) by the waveform difference 


H{X) = h{X) + ShiX). 


( 1 ) 


The waveform templates may be calculated in either the 
time domain h{t; X) or the frequency domain h{f; A); the 
dependence of the waveform on time or frequency is sup¬ 
pressed in our notation for brevity. 

In the context of modelling binary black-hole coales¬ 
cences there are several highly accurate waveform ap- 
proximants available, for example, NR waveforms m or 
spin EOBNR (SEOBNR) models [221 El HD]. There are 
also multiple possibilities for the approximate waveform 
family, for example, the Taylor family of approximants 
[33| . Eor the proof-of-principal numerical calculations in 
this paper, we need to be able to perform mock PE runs 
with both waveform families so that we can assess our 
marginalisation technique does indeed offer a significant 
improvement. Therefore, we will pick both approximants 
to be quick to compute, rather than selecting on accu¬ 
racy: our choice of waveform family is discussed in more 
detail in Sec. IV Al 


In a PE study, we wish to construct the posterior prob¬ 
ability distribution for the signal parameters given the 
observed data (and any prior information we have about 
the source) p(A|s). Erom Bayes’ theorem, the posterior 
is given by 


p{X\s) 


L'{s\X)tt{X) 

Z'(s) 


( 2 ) 


where (keeping the notation of [13]) L'(s|A) is the like¬ 
lihood, 7r(A) is the prior distribution on the parameters 
and 2'{s) is the normalising evidence 




L'(s|A)7r(A)dA. 


(3) 


In a Bayesian analysis the evidence Z'{s) can be used as 
the detection statistic (by comparing it with the evidence 
for the null hypothesis to form the Bayes’ factor) [41], and 
the positions and widths of peaks in the posterior p(A|s) 
are used to give the parameter estimates and associated 
uncertainties [33] • For simplicity (although it is not nec¬ 
essary to do so), we assume throughout that 7r(A) is flat 
within the relevant region of parameter space. The single 
remaining challenge is to calculate the likelihood L'(s|A). 

Eor a detector with stationary, Gaussian noise with 
power spectral density 5'„(/) [33], the likelihood is given 

by [33] 


L'(s|A) cx exp ( — 2 ( s — h{X) s — h{X) 


(4) 


Here the noise-weighted inner product has been defined 

as [33] 


{x\y) 


m 

43 ? 



Hf)yifr ] 
Suif) J 



x{fK)y{fK)* \ 

Snih) j ’ 


(5) 


where k labels the M frequency bins with resolution 5f. 
We define the norm of a waveform as 


11x11 = V^(a^, (6) 


for a signal this is equivalent to its SNR. 

In practice it can be unfeasible to sample from the like¬ 
lihood distribution in Eq. 0 because it is prohibitively 
expensive to calculate the exact waveforms h{X)\ instead, 
we must reply on the approximate waveforms to calculate 
an approximate likelihood. 


L(s|A) oc exp 



s-H{X) 



Eor a good approximant 

L(s|A)«L'(s|A); 


(7) 


( 8 ) 
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the natural way to improve this agreement is to con¬ 
struct (inevitably more expensive) approximants that 
have smaller waveform differences Sh{X). Instead, the 


proposal of this paper is to replace L(s|A) with a new 
likelihood which accounts for the uncertainty in the wave¬ 
forms. The alternative likelihood is 


£(s|A) oc 


Sh{X) P[Sh{X)] exp 



s - H{X) + 6h{X) 



(9) 


This new likelihood has marginalised over the uncertainty 
in the waveform difference using the (as yet unspecified) 
prior on the waveform difference P[6h(X)]. 

The prior on the waveform difference should include 
the information available from the limited number of 
available accurate waveforms and could also encode our 
prior expectations about the signal, for example, that 
the approximate waveforms are most accurate at early 
times (or equivalently at low frequencies) when the or¬ 
biting bodies are well separated [55], but gradually be¬ 
come inaccurate as the bodies inspiral. At most points 
in parameter space, an accurate waveform is not avail¬ 
able, and so it is necessary to interpolate the waveform 
difference across parameter space while simultaneously 
accounting for the error this introduces. It would seem 
that the problem rapidly becomes complicated, and even 
if a suitable prior could be constructed the computational 
time needed to evaluate £(s|A) would make it impractical 
in most contexts. 

Fortunately, the technique of GPR provides a natu¬ 
ral way to interpolate the waveform differences across 
parameter space, incorporating all necessary prior infor¬ 
mation. GPR also has the additional property that it 
naturally returns an expression for P[6h(X)] which is a 
Gaussian in Sh{X). Since the exponential factor in Eq. ^ 
is also Gaussian in Sh{X), the functional integral can be 
evaluated analytically. This gives an analytic expression 
for /l(s|A) which can be evaluated in approximately the 
same computational time as L(s|A). 

Henceforth, for brevity, the s dependence will be sup¬ 
pressed in all likelihoods, i.e. L'{X) = L'(s|A), L(A) = 
L(s|A), find C{X) = C{s\X). 


B. Gaussian process regression 


Assume that we have access to accurate waveforms at 
a few values of the parameters {h{Xi) | f = 1, 2,..., N} 
and can cheaply compute approximate waveforms at the 
same parameter values. Our training set is the set of 
waveform differences 


V = {(x,, dh{X,)^ |i = l, 2 ,..., at} , (10) 


where necessary the Latin indices i, j, ... will be used 
to label the different components of the training set (re¬ 
peated indices are not summed over unless specified). It 
is now necessary to interpolate the training set to ob¬ 
tain the prior on the waveform difference first defined in 
Eq. 


P[6h] = P{Sh{X)\V,I), (11) 


where I is any other prior information we possess about 
the waveforms. The simplest and most natural choice for 
such a prior is to assume that the waveform difference is 
a realisation of a GP (a Gaussian is the the maximum- 
entropy distribution given that we know a characteristic 
range of variation [46]). 

Sh{X)^gP{miX),k{X,X')) (12) 

A GP can loosely be thought of as the generalisation of 
a Gaussian distribution to an infinite number of degrees 
of freedom. It is completely specified by the mean m{X) 
and covariance fc(A,A') functions in the same way as a 
Gaussian distribution is fully specified by a mean and 
variance. More formally, a GP is an infinite collection 
of variables, any finite subset of which are distributed as 
a multivariate Gaussian. For a set of parameter points 
{Ai}, including, but not limited to, the training set V, 


6hiX,) 


~ Af{m, K ), 


(13) 


where the mean vector and covariance matrix of this 
Gaussian distribution are fixed by the corresponding 
functions of the GP, 

[m\ = m{Xi) , [K]^. = k{\, A^), (14) 

with probability density function (here correcting the 
normalising prefactor written in m which mistakenly 
included a square root) 













5 


({<5/i(A0}) 


1 


(27r)^ \K 




(15) 




r 


The round brackets denote a new inner product with re¬ 
spect to some noise weighting S'„{f), which we leave un¬ 
specified for the moment; 

5(/)y(/)‘ 


My) = d/: 

M 


= 43? <^^<5/ 


^ K—l 


S'M.) 


(16) 


In writing down Eq. (15) and stipulating that the 


covariance function fc(A, A') has no dependence on fre¬ 
quency, we are effectively assuming that a) the parame¬ 
ter space structure of the model errors is frequency in¬ 
dependent; and b) the typical size of errors has a fre¬ 
quency dependence proportional to Under the 

assumption that waveform model errors are uncorrelated 
in frequency, the normalising factor in Eq. should be 
raised to the power M; however, this assumption leads to 
model errors that average to zero over frequency and have 
only a small effect on the likelihood. The optimal means 
of incorporating frequency dependence would be to in¬ 
troduce an additional covariance function in frequency 
as well as the covariance in parameter space. This fre¬ 
quency covariance introduces a correlation length scale 
in frequency which can be learnt from the training set in 
exactly the same manner as we describe below for cor¬ 
relations in A. This correlation length scale reduces the 
number of independent frequencies from M to some new 
effective number Mgff. 

Performing this double GPR interpolation in / and 
A is beyond the scope of the current paper. Instead, 
here we are in effect setting Mes = 1, giving the ex¬ 
pression in Eq. this is analogous to assuming that all 
the frequency bins of the noise-weighted waveform at a 
particular point in parameter space are perfectly corre¬ 
lated. Setting Meff = 1 gives the largest uncertainty of 
any fixed number of independent frequencies and is there¬ 
fore a conservative choice. Despite these simplifications, 
our marginalised likelihood has many desirable proper¬ 
ties (which we discuss and prove in Sec. IV), and per¬ 
forms well in the numerical example presented in Sec. [V| 
We will return to the more general problem of performing 
the extended GPR including frequency in the future. 

Specifying how we compute the mean and variance for 
the GP determines how the waveforms are interpolated 
and fixes our prior for waveform uncertainty across pa¬ 
rameter space. Our GP has a zero mean as we have cho¬ 
sen to interpolate the waveform difference rather than 
the waveform directly. By first subtracting off an ap¬ 
proximate model we leave a quantity which is uncertain. 


but has no known bias. If we had some additional prior 
knowledge that the approximate waveform was system¬ 
atically wrong across parameter space, then this should 
be added into the approximate model so that the zero- 
mean assumption becomes valid. Identical results for the 
marginalised likelihood could also be obtained by directly 
interpolating the accurate waveforms using a GP with a 
mean equal to the approximate waveforms; however, we 
choose to interpolate waveform differences because zero- 
mean GPs are simpler to handle numerically. 

Specifying the covariance function is central to GPR 
as it encodes our prior expectations about the properties 
of the function being interpolated. Possibly the simplest 
and most widely used choice for the covariance function 
is the squared exponential (SE) [3S] 


k{Xi, Xj) = CTj exp 


-x5a6(V-A,)“(V-A,)‘ 


(17) 


which defines a stationary, smooth GP. In Eq. 0, 
a scale ct/ and a (constant) metric gab for defining a 
modulus in parameter space have been defined. These 
are called hyperparameters and we denote them as 9 = 
{o'/jffab}, with Greek indices /i, ^, ... to label the com¬ 
ponents of this vector. If the available accurate wave¬ 
forms contain some uncertainty then this can also be in¬ 
cluded by adding a diagonal matrix C to Eq. <0 , where 
the element Cu (no summation) is the uncertainty in 
the accurate simulation at A^; this is discussed further in 
Sec. lmCl 


The probability in Eq. (15) is referred to as the hy¬ 
perlikelihood, or alternatively the evidence (as in [13) 1 for 
the training set; it is the probability that that particular 
realisation of waveform differences was obtained from a 
GP with a zero mean and specified covariance function. 
The hyperlikelihood depends only on the hyperparame¬ 
ters and the quantities in the training set, so we denote 
it as Z{9\'D). The log hyperlikelihood is 


\nZ{e\V) = 


N 

ln( 27 r) 


^^inv k(Xi,Xj) (^6h{Xi) 6h{Xj) 




— In 
2 


det 




(18) 


For all subsequent calculations the values of the hyper¬ 
parameters are fixed to their optimum values dop, defined 






















6 


as those which maximise the hyperlikelihood: 


dZ{9\V) 




s = eop 


= 0 . 


(19) 


Maximising the hyperlikelihood with respect to 9 is one 
of many approaches which could be taken. For example, 
a better motivated approach would be to consider the 
hyperparameters as nuisance parameters in addition to 
the source parameters A, and marginalise over them while 
sampling an expanded likelihood, 


Having fixed the properties of the covariance function 
by examining the training set, we can now move on to 
using the GP as a predictive tool. The defining property 
of the GP is that any finite collection of variables drawn 
from it is distributed as a multivariate Gaussian in the 
manner of Eq. (15). Therefore, the set of variables formed 
by the training set plus the waveform difference at one 
extra parameter point 6h(X) is distributed as 


'Sh{K) 

^6h{X) 


^AA(0,S) , 




( 21 ) 


-^expanded iX,9\V) (xC{X\9,V)Z{9\V). (20) 

The disadvantage of this approach is that the hyperlike¬ 
lihood is much more expense to compute than the stan¬ 
dard approximate likelihood and the inclusion of extra 
nuisance parameters also slows down any PE. In con¬ 
trast, our proposed method of maximising the likelihood 
is a convenient heuristic which is widely used in other 
contexts [T7H4^ and allows all the additional computa¬ 
tion to be done offline. It would be useful, in future work, 
to check explicitly that the different ways of dealing with 
the hyperparameters give consistent results in the con¬ 
text of GW source modelling. 


where K is defined in Eq. 
scalar iF,* are defined as 


(14) and the vector and 


[K,], = fc(A„A), K,, = k{X,X). (22) 


On the right-hand side of Eq. (211 all the quantities are 
known because the hyperparameters have been fixed to 
their optimum values, and on the left hand side all the 
quantities are known (from the training set) except for 


6h{X). Therefore, the conditional probability of the un¬ 
known waveform difference given the known differences 
in V can be found. This conditional probability is given 
by (e.g., [371138]) 


P[ShiX)] = 


2^^^(A)nf=i^U/« 


■ exp - 


(^Sh{X) - n{X) 6h(X) - fi{X) 


2ct2(a) 


(23) 


where the GPR mean and its associated error have been 
defined as 

t,{X) = J2[K4^[K-^]^^ShiX,), (24) 

a\X) = K** - ^ [K.], [K.]^ . (25) 

i,3 


Furnished with the expression for P[5/i(A)], the 
marginalised likelihood in Eq. (|^ can now be evaluated. 
The integrand in Eq. is the product of two Gaussians 
and can be calculated analytically. 


C{X) 


l + ^^(A)nf=l mh)/Snih)) 


exp 


1 

'2 L 


s — 77(A) -l- /^(A) s — 77(A) -I- /i(A) 


The square brackets denote a third inner product with re¬ 
spect to the new noise weighting S”{f), where S”{f, A) = 


Sn{f) + a^X)S'M), 


[x\y] 


43? 

43? 



S'M) I 



f ■ 


(26) 


(27) 
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For the remainder of this paper, for simplicity, we take 
^nif) = ^n{f) so the three signal inner products we have 
defined become (-I-) = (-I-) = [•|•]/(l + tT^(A)) [IS]- With 
this simplifying assumption, the marginalised likelihood 
becomes 


£(A) 


1 + ct2(A) 


X exp 


s — 77(A) + /r(A) 


l + u2(A) 


• (28) 


Gaussian form, we can marginalise analytically to pro¬ 
duce the new likelihood Eq. (28). The properties of this 
marginalised likelihood are explored extensively through¬ 
out the remainder of this paper. 

In Secs. m and 1^ we discuss theoretical properties 
of the GPR and marginalised likelihood respectively. A 
reader who is primarily interested in the PE results ob¬ 
tained with the likelihood in Eq. (28) may skip to Sec.[v| 


III. THE COVARIANCE FUNCTION 


As mentioned earlier Eq. (15) issues that the waveform 
model errors are uncorrelated in frequency. The assump¬ 
tion that S'^{f) = Sn{f) additionally assumes that the 
typical size of the waveform error at a frequency / is given 
by \/ Snif)- This choice can be motivated to a certain ex¬ 
tent by examining the hyperlikelihood in Eq. (18) which 
is used to train the GP. This hyperlikelihood contains the 
overlap matrix ((5/i(Ai)|(5/i(Aj)). Choosing S'^{f) = Sn{f) 
acts to downweight the correlations at frequencies we are 
insensitive to (ignoring errors we cannot measure) and 
hence the resulting hyperparameters give an interpolant 
which is tuned to better represent the waveform correla¬ 
tions at the frequencies to which we are most sensitive: 
we weight waveform errors based upon their impact on 
the likelihood. The assumption of frequency-independent 
models errors gives a value for the GPR uncertainty 
in Eq. (28) that is also frequency-independent. This can 
be shown to be a conservative choice in the sense that it 
gives broader and less informative posteriors. 

In a follow-on study we will provide a proof of the con¬ 
servative nature of this assumption and examine a num¬ 
ber of different choices for the weighting function 5'(j(/), 
but we use the simplifying assumption £'(,(/) = S'„(/) 
throughout the current paper. Despite these simplify¬ 
ing assumptions, we find that the resulting likelihood in 
Eq. (j2^ performs well. In App. [b| we examine the sensi¬ 
tivity of the method to small changes in the noise curve 
Sn{f) which will occur in real experiments. 

In Eq. (28) the best fit waveform has shifted by an 
amount /r(A); this is the best estimate of the waveform 
difference returned by the GPR. The quantity H{X) -I- 
/r(A) can be regarded as a new waveform approximant 
built from the accurate and approximate waveforms with 
the aid of GPR. However, a bonus of this way of including 
the training set directly into the likelihood is that the 
extra uncertainty associated with using the GPR as an 
interpolant is automatically included via the broadening 
of the posterior caused by cr^(A) > 0. 

In this section we have explained how uncertainty in 
waveform models can be included in PE through use of a 
marginalised likelihood. We defined such a likelihood in 
Eq. ([^, but the marginalisation requires a prior prob¬ 
ability on the waveform uncertainty across parameter 
space. We construct this from a training set using GPR; 
the resulting prior is given in Eq. (23). Since this is of 


In the previous section we described how waveform un¬ 
certainties could be marginalised out using a prior con¬ 
structed by using GPR on a training set. The only aspect 
of this that is not prescribed by the training data is the 
choice of the covariance function. This plays an impor¬ 
tant role in determining the properties of a GP. In this 
section, we discuss the properties of different choices of 
the covariance function in GPR. The properties of the 
covariance functions discussed in this section are known 
in the GPR literature, but are included here as they are 
not a widely appreciated in the GW community. The 
material presented in this section on the covariance func¬ 
tion will be used in the interpretation of our results in 
Sec. US and Sec. 0 

The only necessary requirements we have of a covari¬ 
ance function are that it is a positive definite; i.e. for 
any choice of points {Ai} the covariance matrix K^j = 
k{Xi,Xj) is positive definite. 

Throughout this paper, GPs are assumed to have zero 
mean, and therefore be fully specified by the covariance 
function fc(Ai,A 2 ). However, the proofs regarding conti¬ 
nuity and differentiability of GPs discussed in this sec¬ 
tion, and proved in App. [A} are done without recourse to 
the zero-mean assumption. The covariance encodes all 
information available about the properties of the func¬ 
tion being interpolated by the GPR. It is central to the 
GPR and hence also to the marginalised likelihood. 

The covariance function (and the corresponding GP) is 
said to be stationary if the covariance is a function only 
of r = Ai — A 2 , furthermore it is said to be isotropic if 
it is a function only of r = |r| = |Ai — A 2 I.® Isotropy 
of a GP implies stationarity. All of the GPs used for 
numerical calculations in this paper are isotropic (and 
hence stationary) fc(Ai,A 2 ) = fc(r) = ^(t), although the 
generalisation to non-stationary GPs is briefly discussed 
in Sec. IHIBl 

An example of how the properties of the covariance 
function relate to the properties of the GP, and hence 
the properties of the resulting interpolant, is given by 


® We have yet to define a metric on parameter space with which 
to take the norm of this vector (see Sec. |III A| , but all that is 
required here is that a suitably smooth metric exists. 





















considering the mean-square (MS) continuity and differ¬ 
entiability of GPs. It can be shown that the first MS 
derivatives of a GP are MS continuous (the GP is said 
to be rid-times MS differentiable) if and only if the first 
2nd derivatives of the covariance function are continu¬ 
ous at the diagonal point Ai = A 2 = A*. For a stationary 
GP this condition reduces to checking the 2nd derivatives 
of k{T) at r = 0, and for an isotropic GP checking the 
2nd derivatives of A:(r) at r = 0. A proof of this result, 
following [39], is given in App. It is the smoothness 
properties of the covariance function at the origin that 
determine the differentiability of the GP. This result is 
used in Sec. |IIIB| when discussing different functional 
forms of covariance for use in GPR. 

In this section, the effect of the choice of covariance 
function on the GPR are explored. We consider three as¬ 
pects that enter the definition of the covariance function: 

(A) specifying the distance metric in parameter space 

9abl 

(B) specifying the functional form of the covariance 
with distance fc(r), 

(G) and whether or not to include errors cr„ on the 
training set points. 

Stages A and B cannot be completely separated; there 
exists an arbitrary scaling, a of the distance r —>■ ar 
which can be absorbed into the definition of the covari¬ 
ance, ^(t) — k{T/a). However, provided the steps are 
tackled in order, there is no ambiguity. 


A. The metric gab 


plying this procedure to a D-dimensional SE function 
generates a non-stationary analogue |50| 


fc(A,A,) = 


X exp ( --Qzj I , 


g^ -b g^ 
2 


- 1/2 


(29) 


where 


Q., = (A.-A,)“(A,-A,) 


b I Sib + Sib 


-1 


(30) 


and gif, = inv[ 5 ab(Ai)] is the inverse of the parameter- 
space metric at position A^. Provided that the metric 
gabW is smoothly parameterised this non-stationary SE 
function retains the smoothness properties discussed ear¬ 
lier. 

For the interpolation of waveform differences, it is easy 
to imagine the potential benefits of using non-stationary 
GPs. For example, in the case of the spin parameter, it 
could be imagined that the waveform difference consid¬ 
ered as a function of the effective spin of the compact 
objects Sh{x) would vary on long length scales in y for 
small values of the spin, but on much shorter scales for 
larger values of the spin. 


The generalisation in Eq. (291 involves the inclusion of 


a large set of additional hyperparameters to characterise 
how the metric changes over parameter space; for exam¬ 
ple one possible parameterisation would be the Taylor 


series 


The first stage involves defining a distance r between 
two points in parameter space. One simple way of doing 
this, and the way used in the SE covariance function 
in Eq. ([^, is to define = gabi^i - - A 2 )^, 

where gab are constant hyperparameters. This distance 
is obviously invariant under a simultaneous translation 
of Ai —>■ Ai -b A and A 2 —t A 2 -b A; therefore, this defines 
a stationary GP. For a H-dimensional parameter space, 
this involves specifying D{D-\-l)/2 hyperparameters gab- 
More complicated distance metrics (with a larger num¬ 
ber of hyperparameters) are possible if the condition 
of stationarity is relaxed, i.e. gab —t gabW- K was 
demonstrated by |50j how, given a family of station¬ 
ary covariance functions, a non-stationary generalisation 
can be constructed. A stationary covariance function 
can be considered as a kernel function centred at Ai; 
^(Ai, A 2 ) = fc^^(A 2 ). Allowing a different kernel function 
to be defined at each point Ai, a new, non-stationary co- 
variance function is fc(Ai, A 2 ) = / dMfca(Ai)fcs(A 2 ).® Ap- 


® To see that fc is a valid covariance function consider an arbi- 


3ab(A) — gabiXo) + ^A”^ — Aq^ 


dgabW 

d\^ 


A—Ao 


+ ... (31) 


with the hyperparameters 5ah(Ao), dgabW/dX'^, and so 
on. As we see below, the inclusion of even a single extra 
hyperparameter can incur a significant Occam penalty 
m which pushes the training set to favour a simpler 
choice of covariance function. For this reason we only 
consider stationary GPs. However, the generalisation to 
a non-stationary GP (perhaps in only a limited number 
of parameters, e.g., spin) should be investigated further 
in the future. In making this generalisation, one would 
have to be guided significantly by the prior expectations 
of which parameters to include and how to parameterise 
the varying metric. 

An alternative to considering non-stationary metrics is 
instead to try and find new coordinates A = A(A) such 


trary series of points {Aj}, and the sum over training set points 
I = j aiajk(Xi/\j)-, for to be a valid covariance it is both 
necessary and sufficient that / > 0. Using the definition of k gives 
I = f du Ylij = J du{Y.iUikfi{\i))'^ > 0. 
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that the metric in these coordinates becomes (approx¬ 
imately) stationary. There could be hope for this ap¬ 
proach, as a similar problem has been tackled in the con¬ 
text of template placement for GW searches m- Here 
the problem is to find coordinates such that waveform 
templates placed on a regular grid in these coordinates 
have a constant overlap with each other. The waveform 
match can be viewed as defining a metric in parameter 
space, and hence the desired coordinates make this metric 
stationary. For a post-Newtonian inspiral signal, a set of 
chirp-time coordinates were proposed by |52| which make 
the metric nearly stationary. Metrics have also been cal¬ 
culated for inspiral-merger-ringdown models, for exam¬ 
ple IMRPhenomB [53j. While it could be possible to 
adapt the parameter-space metrics already calculated for 
different approximants for use in template placement al¬ 
gorithms to help in constructing our GPR training sets, 
we do not consider this approach further here. 

Throughout the remainder of this paper the metric 
components gab are treated as constant hyperparameters 
fixed to their optimum values, as discussed in Sec. [H] 


B. The functional form of fe(r) 

The second stage of specifying the covariance function 
involves choosing the function of distance A:(t). In gen¬ 
eral whether a particular function ^(t) is positive definite 
(and hence is a valid covariance function) depends on the 
dimensionality D of the underlying space (i.e. A G K^); 
however, all the functions considered in this section are 
positive definite for all D. Several choices for A:(r) are 
particularly common in the literature. These include the 
SE covariance function (which has already been intro¬ 
duced), given by 

ksE (t) = cr^ exp . (32) 

The power-law exponential (PLE) covariance function is 
given by 



where 0 < rj <2. The PLE reduces to the SE in the case 
77 = 2. The Cauchy function is given by 

cr? 

fcCauchy(T) = _|_ ^ 2 / 27 y)^’ ’ 

where 77 > 0. This recovers the SE function in the limit 
77 —>■ 00 . And finally, the Matern covariance function is 
given by |Slj 

fcMat(T) = , (35) 

where 77 > 1/2, and AT,, is the modified Bessel function 
of the second kind |SS]. In the limit 77 —>■ 00 , the Matern 
covariance fnnction also tends to the SE. 


Fig.g shows the fnnctional forms of the covariance 
functions. They have similar shapes: they all return a 
finite covariance at zero distance which decreases mono- 
tonically with distance and tends to zero as the distance 
becomes large. In the case of interpolating waveform dif¬ 
ferences this indicates that the errors in the approximate 
waveform at two nearby points in parameter space are 
closely related, whereas the errors at two well separated 
points are nearly independent. The PLE, Cauchy and 
Matern function can all be viewed as attempts to gen¬ 
eralise the SE with the inclusion of one extra hyperpa¬ 
rameter 77 , to allow for more flexible GP modelling. All 
three alternative functions are able to recover the SE in 
some limiting case, but the Matern is the most flexible 
of the three. This can be seen from the discussion of the 
MS differentiability of GPs given at the beginning of this 
section. 

The SE covariance function is infinitely differentiable 
at r = 0, and so the corresponding GP is infinitely MS 
differentiable. The PLE function is infinitely differen¬ 
tiable at r = 0 for the SE case when 77 = 2, but for 
all other cases it is not at all MS differentiable. In con¬ 
trast, the Cauchy function is infinitely differentiable at 
T = 0 for all choices of the hyperparameter 77 . The 
Matern function, by contrast, has a variable level of dif¬ 
ferentiability at T = 0 , controlled via the hyperparameter 
77 [^. The GP corresponding to the Matern covariance 
function in Eq. (351 is n^-times MS differentiable if and 
only if 77 > 77 ( 7 . This ability to adjust the differentiabil¬ 
ity allows the same covariance function to successfully 
model a wide variety of data. In the process of maximis¬ 
ing the hyperlikelihood for the training set over hyperpa¬ 
rameter 77 , the GP learns the (non)smoothness properties 
favoured by the data, and the GPR returns a correspond¬ 
ingly (non)smooth function. 


C. The inclusion of noise a„ 

Even the most accurate waveform models h{X) still 
contain some error with respect to the unknown true 
physical signal h'{X). This could be because the wave¬ 
form model does not include all of the physics or because 
it is calculated using a method with finite accuracy. We 
can account for the error in our training set points by 
adding a noise variance term j in the covariance 

function, 

k{Xi,Xj) —>■ k{Xi,Xj) -\- , (36) 

which alters the covariance matrix in Eq. (|14[ ) corre¬ 
spondingly, but not the expressions in Eq. \22\ . Here 
iJn,i is the fractional error ||/i—/ 7 '||/|| 5 /i|| in each training 
set point, where the norm is taken with respect to the 
inner product in Eq. ([^ and Sh = H — h. This ensures 
that tTj is still an overall scale for the covariance function. 

We do not maximise the hyperlikelihood over cr^ this 
is because cr„,i is related to \\h — h'\\, which cannot be 
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Power-law exponential Cauchy Matern 



FIG. 1: Plots of the different generalisations of the SE covariance function discussed in Sec. |III B] The left-hand panel shows 
the PLE function, the centre panel shows the Cauchy function, and the right-hand panel shows the Matern function; in all 
cases the value of cr/ was fixed to unity. In each panel the effect of varying the additional hyperparameter is shown by the 
three curves. For the PLE covariance the case 77 = 2 recovers the SE covariance, while for the Cauchy and Matern covariances 
the case 77 —>■ 00 recovers the SE covariance. 


learnt from a training set containing the differences Sh. 
The noise error is instead fixed at some overall error esti¬ 
mate for the accurate model, which is a conservative ap¬ 
proach. We consider the simple case an, i = cr„ in this pa¬ 
per; however, it is not necessary for all training set points 
to have the same error, as a training set might comprise 
different families of waveform models (e.g., a mix of dif¬ 
ferent variants of (S)EOBNR or IMRPhenom waveforms, 
or NR waveforms with different numerical resolutions). 

If the overall noise error is a fan, then the GPR uncer¬ 
tainty at a training set point (T(Ai) satisfies, 

a(A,)<a/a„, V7G{l,2,...,fV}. (37) 


This is because the different points in the training set are 
assumed to come from a correlated GP, and so nearby 
measurements also act to decrease the error. 

There is also a more practical motivation for the in¬ 
clusion of noise. Inversion of the covariance matrix in 
Eq. (141 can pose issues of numerical stability for large 
training sets. In general, as the number of points in the 
training set increases, the determinant of the covariance 
matrix decreases rapidly towards zero, such that the ma¬ 
trix is nearly singular (and hence the matrix is difficult to 
invert). The solution to this is to add a small fixed noise 
= J ^ 1, or jitter, to each training set point as per 
Eq. ( |36[ ). The eigenvalues of the new covariance matrix 
are then (approximately) the eigenvalues of the original 
matrix plus J. This prevents the determinant, the prod¬ 
uct of the eigenvalues, from becoming vanishingly small 
and dramatically improves the stability of the inversion. 
In effect, we are no longer requiring our interpolant to 
pass through every training set point; instead, we only 
ask it to pass close to each point (with the proximity 
determined by the value of J). 


D. Compact support and sparseness 


When evaluating the covariance matrix for the training 
set Kij this leads to a matrix where all entries are posi¬ 
tive; i.e. a dense matrix. When performing the GPR it is 
necessary to maximise the hyperlikelihood for the train¬ 
ing set with respect to the hyperparameters. This process 
involves inverting the dense matrix Kij at each iteration 
of the optimisation algorithm. Although this procedure 
is carried out offline, it still can become prohibitive for 
large training sets. A related problem, as pointed out 
in Sec. |III C| is that for large training sets the determi¬ 
nant of the covariance matrix is typically small which 
also contributes to making the covariance matrix hard to 
invert. 

One potential way around these issues is to consider a 
covariance function with compact support, 

/c(r )>0 tG[ 0 ,T], 

A:(t) =0 V r G (T, 00 ), 


where T is some threshold distance beyond which we as¬ 
sume that the waveform differences become uncorrelated. 
This leads to a sparse, band-diagonal covariance matrix, 
which is much easier to invert. Gare must be taken 
when specifying the covariance function to ensure that 
the function is still positive definite (which is required of 
a GP): if the SE covariance function is truncated, then 
the matrix formed from the new covariance function is 
not guaranteed to be positive definite. 

Nevertheless, it is possible to construct covariance 
functions which have the requisite properties and sat¬ 
isfy the compact support condition in Eq. (391. These 
are typically based on polynomials. We consider a series 
of polynomials proposed by [56], which we will refer to 
as the Wendland polynomials. These have the property 
that they are positive definite in and are 2q-i\me dif¬ 
ferentiable at the origin. Therefore the discrete parame¬ 
ter q is in some sense analogous to the 77 hyperparameter 
of the Matern covariance function in that it controls the 
smoothness of the GP. Defining fj to be 


All of the covariance functions considered up until this 
point have been strictly positive; 


/? 


D 

~2 


+ g + l 


(40) 


(38) and using 0(a:) to denote the Heaviside step function, the 


/c(t) >0 V r G [0,00). 
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Wendland polynomials 



FIG. 2: Plots of the first few Wendland polynomial covariance 
functions. All these functions have compact support, fc(r) = 0 
for r > 1. As the value of q increases the functions become 
smoother near the origin. 


first few Wendland polynomials are given by, 


kD,o{T) 

1 

© 

b 

II 

T){l-Tf , 


(41) 

koAir) 


t)( 1 - r)'^+i 

[1 + (/3 + 1) "t] , 

(42) 

kD,2{T) 

I 

1—1 

© 

II 


[ 3 + (3/3 + 6) r 



+ (/?" + 

4^ + 3) r^] , 


(43) 

kD,3{T) 

(T? 

= ^0(1- 
15 ^ 

.^)(l-^)/3+3 

[15 -f (15/3 -h 45) T 


-f ( 6/32 H 

h 36/3 -h 45) 




+ (/3^ + 

9^2 + 23/3-M5) T^] . 

(44) 


These are plotted in Fig. Other types of covariance 
function with compact support have also been proposed 
and explored in the literature (e.g., [F7U59] b but we do 
not consider them in this paper. 


IV. PROPERTIES OF THE METHOD 


In this section proofs of several useful features of the 
marginalised likelihood are presented. In Sec. |IV A| we 
derive the PE error in a linearised formalism, recovering 
results of m as well as new results for our marginalised 
likelihood; in Sec. |IVB| we use these results to show that 
the marginalised likelihood should not exclude the true 
parameter values even at large SNR, and in Sec. |IV C[ 
we derive other limits of the marginalised likelihood at 
specific points in parameter space. 


A. The error at linear order 

A more detailed understanding of the theoretical error 
problem, and the solution offered by the marginalised 
likelihood can be gained by examining the behaviour of 
the likelihoods in the vicinity of a maximum. 


The exact likelihood, from Eq. Q, is given by 


L'{X) oc exp 



s — h{X) 



(45) 


and has a maximum at the best fit parameters, Abf, which 
satisfy the equation 


dah{Xhi) s - /i(Abf)) = 0. 


(46) 


The measured data consist of noise and the physical sig¬ 
nal with the true parameters, Atr, that is s = n -I- /i(Atr). 
Therefore Eq. (46) becomes 


9a/l(Abf) U + /l(Atr) - /l(Abf)) = 0. (47) 


Expanding the difference in the signals to leading order 
in AA = Abf — Atr gives 


dahiX^) n - AA^abh(Abf)) = 0, (48) 


whence 


AA“ = (S-i)“'’(n|ab/i(Abf)) , (49) 


where S^b = {dah{Xhf)\dbh{Xhf))■ Therefore, at leading 
order, the shift between the best fit and true parameters 
for the exact likelihood consists of one term proportional 
to n; we call this the noise error. The matrix is 
the usual Fisher information matrix (FIM) which char¬ 
acterises the random errors at leading order [50]. 

The approximate likelihood, from Eq. is given by 


L{X) oc exp 



s - H{X) 



(50) 


and has a maximum at the best fit parameters which 
satisfy the equation 


daH{Xu) s-H{Xm))=0. 


(51) 


Using s = n -I- /i(Atr) in Eq. (51) and expanding to lead¬ 
ing order in A A gives 


daH (Abf) 


ShiXtr)-AX^dbHiXu))=0, 


(52) 


thus 


AA“ = (r-i)“'’(n|abi7(Abf)) 

- (r-i)“' (^ShiXtr)\dbHiXM)) , (53) 


where Fab = {daH{Xh{)\dbH{Xhi))■ Therefore, at leading 
order the shift between the best fit and true parameters 
for the approximate likelihood consists of two terms: the 
noise error as before (except with the FIM evaluated with 
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the approximate model) and what we call the model er¬ 
ror, 


A 


modelA“ = - (,5/i(Atr)|a6i/(Abf)) . (54) 


The model error is independent of the noise realisation, 
and hence represents a systematic error in the PE asso¬ 
ciated with using inaccurate models. 

The above treatment of the exact and approximate 
likelihoods is a brief summary of part of the analysis done 
by [ig. We now apply the same type of analysis to the 
new marginalised likelihood to see how this reduces or 
removes the model error. 

The marginalised likelihood is given in Eq. (|^. From 
Eq. (25) it can be seen that the interpolated waveform 


difference ^(A) is a linear combination of Sh{Xi) from the 
training set. We will assume, for this calculation only, 
that the waveform difference is also a small quantity in 


the sense that ||5/i|| <C ||/i|| with the norm from Eq. (§. 
Therefore, fj, = 0{6h) and cr/ = 0{6h). We shall keep 
contributions up to 0{Sh). 

Under the twin assumptions that AA and ||i5/i|| are 
small, the marginalised likelihood is approximately given 
by 


£(A)« exp 



s — H[X) 



(55) 


This has a maximum at the best fit parameters Abf which 
satisfy the equation 


{da (^(Abf) - i^(Abf)) 


s — i/(Abf) -I- /i(Abf)^ 



Using s = n + ft,(Atr), and expanding to leading order in 
AA and Sh gives 



da (^(Abf) — n + /i(Atr) — H{Xm) + n{Xhi) 

(/^(Abf) — ^f(Abf)^ n — 6h{Xtr) — AA^9hiJ(Abf) -I- /^(Abf) 


0 , 

0 . 


(57) 

(58) 


This expression can be rearranged to find AA, dropping 
all terms second order in small quantities, 


AA“ = (r-i)“'’(n|4(i7(Abf)-M(Abf))) 

{sh{Xt,)\dbH{XM)) 

+ (r-i)“VAi(Abf) af,i7(Abf)) . (59) 


Therefore, at leading order, the shift between the best 
fit and true parameters for the marginalised likelihood 
consists of three terms: the noise and model errors from 
before, and a new shift arising from the marginalisation, 

AniargA“ = (T” 1) (/r( Abf )1( Abf )). The expression 
for the model and marginalisation errors are similar and 
appear with opposite signs (as would be hoped since the 
maginalised likelihood was designed to remove the model 
error) so the remaining model error is proportional to 
Sh{Xhf) — /r(Abf) (integrated inside the inner product). 

If the training set is dense in the region of the 
peak, and the hyperparameters have been correctly es¬ 
timated, it is reasonable to assume that the GPR in- 
terpolant of the waveform difference performs well, and 
we have {Sh{X) — Ai(A)|-) « 0. Under these conditions the 
marginalised likelihood removes the systematic model er¬ 
ror from the parameter estimates. In reality the interpo¬ 
lation is not perfect, and the method is limited by the 
available information in the training set, so that a resid¬ 
ual model error proportional to (5/i(Abf) — /i(Abf)|-) re¬ 
mains. 


B. The limit of large SNR 


As first pointed out by [Tg, the systematic error as¬ 
sociated with the inaccurate model used in the approxi¬ 
mate likelihood is independent of the SNR, whereas the 
random error associated with the noise realisation de¬ 
creases with increasing SNR. Therefore, there exists a 
critical SNR for the approximate likelihood above which 
the systematic model error dominates the random noise 
error. If the approximate likelihood is used to infer the 
parameters of a source with an SNR close to or above 
this critical value then the inferred parameters are sig¬ 
nificantly and systematically biased. In this section we 
examine the behaviour of all three likelihood functions 
for large SNR and show that the marginalised likelihood 
does not suffer from this problem even in the limit of infi¬ 
nite SNR. Therefore, parameter estimates obtained using 
the marginalised likelihood can always be trusted. 

In this section in order to ease the process of taking 
the limit of large SNR all waveforms are understood to 
be normalised such that ||Ii(A)|| = I, and the amplitude 
is taken out as a prefactor, so the full signal is Ah{X). In 
addition we will assume for simplicity that the measured 
value of A is equal to the true value for the signal; this 
has no effect our final result. 

The exact likelihood Eq. (|^ is given by 


L'(A) (X exp 



s — Ah{X) 



(60) 
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The measured data is given by s 
exact likelihood is peaked at Abf 
Eq. @) 


n + Ah{Xtr), and the 
Atr + AA, where (see 


AA“ = - (S 


-1 


, ab 


dbh{Xhf) 


(61) 


In this section, the FIM 'Eab is defined in terms of the 
normalised waveforms, i.e. is independent of A] this is 
done so that all of the dependence on A remains explicit. 
The exact likelihood evaluated on the true parameters is 
given by 


L'(Atr) oc exp ||nf^ . (62) 

The exact likelihood evaluated on the best-fit parameters 
is given by 


L'(Abf) oc exp 


n + A [h{Xtr) - /i(Abf) 


■ (63) 


The ratio of these two likelihood values is denoted 
Rcxact = L'(Xti)/L'{Xhf)■ Expanding the difference 
h{Xti:) — h{Xhi) in the above equation to leading order 
in AA gives 


daHXhi)) (n dbh(Xh{) 


lni?exact = -^(E-i)“\n 

. 

The quantity Rexact is the factor by which the likelihood 
of the true parameters is suppressed with respect to the 
peak likelihood. From Eq. ( [m| it can be seen that this 
factor is a random variable dependent on the noise reali¬ 
sation n; the expectation of this random variable is given 
by [H] 


Ini? 


exact 



(65) 


Both Eqs. (651 and Eq. (64) are independent of the sig¬ 


nal amplitude j 4, and hence are unchanged by taking the 
limit of large SNR, —>■ oo. Therefore (as one might have 
expected) the exact likelihood evaluated at Atr remains 
finite in this limit and the true parameters are never com¬ 
pletely excluded from the posterior at any value of the 
SNR. 

The approximate likelihood Eq. ([^ is given by 


L(A) oc exp 



s - AH{X) 



( 66 ) 


The approximate likelihood is peaked at Abf = Atr + AA, 
where (see Eq. (53l) 


AA“ = - (r 

A ^ 


-1 


ab 


dbH{X 


bf, 


- (r-i)“' (5h{K)\dbH{XM)) ■ (67) 

The FIM Tat, is also here defined to be independent of 
A. The approximate likelihood evaluated on the true 
parameters is given by 


L(Atr) oc exp 


1 
2 

oc exp [ 


n -l- ^ ^^(Atr) — 77(Atr)^ 


n — A5h{Xf[) 


( 68 ) 


The approximate likelihood evaluated on the best fit pa¬ 
rameters is given by 


L(Abf) oc exp 
oc exp 


n- A {h{Xtv) - i7(Abf)) 
n + A (5h{K) - lxydaH{Xu)) 


(69) 


r 


where, as before, the waveform difference has been ex¬ 
panded to leading order in AA. The ratio of the two like¬ 
lihood i?approx = E(Atr)/T(Abf) Can be evaluated from 
Eq. (68) and Eq. ( [6^ , and taking the limit of large SNR 
gives 


lim In R. 
A —¥00 


approx 




-1 


, ab 


ShiXtr) aai7(Abf); 

X {6h{Xtr) dbH(XM)) . (70) 


Unlike i?exact, this ratio does not depend on n. This is 
because in the limit of large SNR, only the terms from 
the exponents of Eq. (681 and Eq. (691 proportional to A"^ 


contribute, and the noise-dependent terms are all propor¬ 
tional to A. Also unlike i?exact, this ratio does depend on 
the amplitude and i?approx —t 0 in the limit of large SNR. 
Therefore, as anticipated above, the approximate likeli¬ 
hood excludes the true source parameters with complete 
certainty in the limit of large SNR (unless {Sh{Xtr)\-) = 0, 
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in which case the exact likelihood is recovered). 
The marginalised likelihood Eq. ([^ is given by 


C{X) (X exp I -- 


s — AH{X) + A^{X) 


1 + ^V2(A) 


(71) 


The values of the marginalised likelihood evaluated on 
the true and best-fit parameters are given by Eq. (73) 


and Eq. (74), and the ratio of these two likelihoods is 
denoted R, 


inarg? 


The marginalised likelihood is peaked at Abf = Atr + AA, 
where, by comparison with Eq. (591, 

AA“ = ^(r-i)“'’(n|4(i?(Abf)-M(Abf))) 

(^dh{Xt,)\dbH{XM)) 

+ (r-i)“''(Ai(Abf)|5f,i7(Abf)) . (72) 


£(Atr) OC exp I -- 


n — A(5h(Atr) + Al/i(Atr) 


£(Abf) c>c exp 


1 -I- A'^a'^ {Xtr) 
n — ASh(Xtr) — Al-AA“9ai7(Abf) -I- A/i(Abf) 


1 -I- A'^a'^{Xb{) 


lim Iniin 

A—VOO 


1 


{Sh{Xtr) - M(^f) daHiXu)) (ShiXtr) - M(Abf) 5b£f(Abf) 


2cr2(Abf) 

2 , 

M(Abf) T /r(Atr) T 2 ^ J/i(Atr) l^(Atr) /^(Abf) 

1 


2cr2(Abf) 


(r-i)“'’ (<5h(Atr) - M(Abf)|aai7(Abf)) ((5h(Atr) - Ai(Abf)|4i7(Abf)) . 


(73) 

(74) 


(75) 

(76) 


The approximation made in going from Eq. (75) to 
Eq. (76) involves dropping terms which are products 
of small quantities. Because the EIM is a symmetric, 
positive-definite matrix, the numerator in Eq. (76) is a 
negative number, and hence i?marg < 1 as required to 
ensure Abf is the peak of the likelihood. 


As was the case with i?approx 5 this expression for i?marg 
does not depend on the noise. However, unlike i?approx 
the expression for i?marg also does not depend on the 
amplitude A. Therefore, in the limit that the SNR be¬ 
comes large i?marg tends to a constant value which de¬ 
pends quadratically on {Sh{Xtr) — /f(Abf)|-). As the SNR 
increases, the true parameters are not excluded from the 
marginalised likelihood, instead the likelihood distribu¬ 
tion tends to a constant distribution (i.e. no dependence 
on n), and the ratio by which the true parameters are 
disfavoured compared to the best fit parameters is set 
by the ability of the GPR to recover the true waveform 
difference. 


Intuitively, the reason the marginalised likelihood is 


able to achieve this useful behaviour, even if the true 
waveform difference is not perfectly recovered by the 
GPR interpolation (i.e. (Sh(Xtr) — fJ-(Xh{)i} 7 ^ 0), is due to 
the way the hyperparameters in the covariance function 
are chosen. The hyperparameters were fixed to their op¬ 
timum values by maximising the hyperlikelihood for the 
training set (as described in Sec. E- During this process 
the overall scale hyperparameter crj gains a dependence 
on the amplitude proportional to A^. Hence the GPR 
uncertainty cr^(A) is also proportional to A^. As can be 
seen from Eq. (0, in the limit of large SNR the am¬ 
plitude dependence cancels in the exponential and the 
marginalised likelihood tends to a constant distribution. 
Therefore, the marginalised likelihood never excludes the 
true source parameters from the final posterior with com¬ 
plete certainty. 
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C. Limits of the marginalised likelihood across 
parameter space 

In this section we examine the behaviour of the 
marginalised likelihood in the limit of being far from any 
training points and being at a training point. 

First we examine the behaviour of the marginalised 
likelihood in the former case, at a large distance (r^ ^ 1) 
from any of the points in the training set. From Eq. (25) 


it can be seen that well outside of the training set /r(A) —)■ 
0 and cr^(A) —>■ aj. Therefore, from Eq. (28), the log 
marginalised likelihood tends to 


ln/:(A) ^ 


InL(A) 


l + ShiX,) 


(77) 


Well outside of the training set the marginalised likeli¬ 
hood ln>C(A) recovers the standard, approximate likeli¬ 
hood L{X) up to a constant factor. This constant factor 
is one plus a linear combination of the overlap integrals 
of all the waveform differences in the training set. Since 
the denominator in Eq. (77) is always greater than unity 


(this is ensured by the positive-dehnite property of the 
covariance matrix), it broadens any peak in the likelihood 
outside of the training set. The amount of the broaden¬ 
ing is set by the magnitude of the waveform differences in 
the training set via the overlap matrix {dh{Xi)\Sh{Xj)). 
This is the behaviour that would be expected; in the 
absence of any accurate waveforms the parameter uncer¬ 
tainties obtained from the approximate waveforms should 
be multiplied by a constant factor depending upon our 
level of belief in the accuracy of the approximate wave¬ 
form model. In turn, our level of belief in the accuracy 
of the approximate waveform is learnt from the training 
set in the process of training the GP. 

We now consider the behaviour of the marginalised 
likelihood evaluated at one of the training set points A^. 
First, consider the case where (T„ = 0. In this case, the 


interpolated waveform difference, from Eq. (24), at Xe 


recovers the true waveform difference, and the GPR un¬ 


certainty, from Eq. (25), vanishes at Xf, 


/r(A^) = 5h{Xe ), 

CT^Xi) = 0 . 


(78) 

(79) 


Therefore the marginalised likelihood in Eq. (28) recovers 


(80) 


the exact likelihood with no additional broadening. 

C{Xi) = L\Xe). 

This is also the behaviour that would be expected; at a 
point in parameter space where the accurate waveform is 
known, the accurate likelihood is recovered. 

If an 7 ^ 0, then Eq. ([7^ and Eq. ([7^ become 


m(A,) = Sh{Xe)-alY.HX,,Xi)6hiX,) + Oiaf^)^ 


a^(X,) = a2^fc(A„A,)fc(A„A,) + 0(a4). (81) 


In this case any peak in the marginalised likelihood will 
be slightly shifted and broadened relative to the peak in 
the accurate likelihood by an amount consistent with the 
uncertainty cr„ in the accurate waveform model. 


V. IMPLEMENTATION 


In this section we present an illustrative implementa¬ 
tion of our GPR approach. As a simple example, we 
consider estimating a single parameter; a full multidi¬ 
mensional application that would be appropriate for ac¬ 
tual GW data analysis will be investigated in future work. 
We begin in Sec. |V A| by introducing the waveforms we 
use for this study. In Sec. |VB| we describe the place¬ 
ment of the training set points for the GPR; in order to 
investigate the effect of training set on the GPR inter- 
polant two sets were constructed with different numbers 
of points and grid spacings. In Sec. |V C| we present re¬ 
sults for maximising the hyperlikelihood to find the opti¬ 
mum hyperparameters, dop, for the interpolation; this is 
done for a range of different covariance functions on each 
of the training sets described in Sec. |VB| In Sec. |VD| 
we interpolate the waveforms across parameter space for 
the different training sets and different covariance func¬ 
tions described and compare the interpolated wavef orms 
H{X) — fi{X) to the accurate waveforms h{X). In Sec. V 
we present results for the GPR uncertainty, cr^(A), for the 
different training sets and different covariance functions 
considered. Finally in Sec. |VF| we present results for 
the marginalised likelihood >C(A), and compare with the 
results obtained using the approximate likelihood L(A), 
and the exact likelihood L'{X). 


A. Model waveforms 

In order to implement the GPR, a choice has to be 
made regarding which waveform models to use. The 
method uses two waveform approximants; the accurate 
h{X) and the approximate H{X) waveforms. The accu¬ 
rate waveform should be the most accurate available at a 
computational cost that permits the offline construction 
of the training set V. The criteria for choosing the ap¬ 
proximate waveform is less clear, a balance needs to be 
struck between accuracy and speed. If the model is com¬ 
putationally cheap but not accurate enough the wave¬ 
form difference, 5h{X) = H{X) — h{X), will be large and 
vary on short length scales over parameter space; these 
are the situations which will cause the GPR to perform 
worst. On the other hand an accurate model which is 
too computationally expensive could slow down any PE 
to such an extent that there ceases to be any benefit in 
using the marginalised likelihood instead of the accurate 
likelihood. 

We used two waveform models implemented in 
the LIGO Scientific Gollaboration Algorithm Library 
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(LAL)7 As our intention here is to provide a proof of 
principle, we choose the IMRPhenomC approximant [61] 
as the accurate waveform and the widely used TaylorF2 
approximant [33J [SH [53] as the approximate waveform; 
both of these models are sufficiently fast to evaluate that 
we can compute and then compare the three likelihoods 
(accurate L'{X), approximate A(A), and marginalised 
£(A)) and directly assess the performance of the GPR. 

Both of the approximants we have chosen to use here 
are frequency-domain models, i.e. they naturally return 
the waveform in the Fourier domain h(/).® The IMRPhe¬ 
nomC waveform includes inspiral, merger and ringdown, 
while the TaylorF2 waveform only includes the inspiral. 

We investigate the merger of non-spinning circular 
binaries. This limits the number of intrinsic parame¬ 
ters describing the system to two, the masses of the 
two component objects, A = {mi, m 2 }. To further 
simplify the problem we place training set points only 
along a one-dimensional subspace, which we choose to 
be a surface of constant mass ratio, Q = m 2 /mi (with 
mi > m 2 ), parametrised by the value of the chirp mass 
Me = (mim 2 )^/^/(mi -I- m 2 )^^^. This keeps the size 
of the training set small, and hence the computational 
complexity of the GPR to a minimum. This allows us to 
instead focus our attention on the novel features of the 
marginalised likelihood, and explore the effect of chang¬ 
ing various features of the method. 


B. The training set 

For simplicity we restrict the range of the coordinates 
which we search over to reduce the computational com¬ 
plexity. This is again to allow us to focus our attention on 
the novel features of the method. The training sets cover 
the chirp mass in the range Me S [5, 5.6] Mq and the 
mass ratio is fixed to the (nearly equal mass) Q = 0.75. 
The placement of training set points was done as a regu¬ 
lar grid in chirp mass with a step size between points of 
AMe. 

The chirp-mass range has been chosen to demonstrate 
the properties of the method. For lower masses, the sig¬ 
nal is dominated by the inspiral where both approximants 
agree well. Therefore, interpolating these small differ¬ 
ences would not be a robust test. At higher masses, 
where the signal is just merger and ringdown, the two 
approximants are completely different; we get no use- 


^ http://www.lsc-group.phys.uwm.edu/lal 

® in previous work m the marginalised likelihood has been im¬ 
plemented with time-domain approximants. The method works 
equally with frequency-domain or time-domain models without 
the need to transform between them. In the offline stage the 
waveforms enter only via the overlap matrix {Sh{Xi)\5h{Xj)), and 
in the online stage the waveforms enter only in the linear combi¬ 
nation for /r(A) in Eq. \2A\ , which commutes with the operation 
of taking the Fourier transform. 


TABLE I: The properties describing the positions of the tem¬ 
plate waveforms for each of the three training sets used. 



AMe 

N 

Vo 

1.0 X 1O"^M0 

60 

Vi 

5.0 X IQ-^Mq 

120 


ful information from the TaylorF2 waveform and may as 
well interpolate IMRPhenomG directly. We do not an¬ 
ticipate that in practice we would consider waveform dif¬ 
ferences as significant as the complete absence of merger 
and ringdown; hence, this example, although only one¬ 
dimensional, should be a rigorous test of what waveform 
uncertainties can be successfully interpolated. Under¬ 
standing if this continues to be true for the interpolation 
of a more intricate waveform difference across a higher¬ 
dimensional parameter space, must wait for further stud¬ 
ies to be completed. 

To allow us to explore the effect that the density of 
points in the training set has on the marginalised likeli¬ 
hood two different values for AMe were considered. This 
leads to two different training sets whose total number n 
of points are different; the properties of these two train¬ 
ing sets are summarised in Tab. |l| It is expected that 
the GPR interpolation, and hence the marginalised like¬ 
lihood, will perform better when using the denser set . 

Once the training set points {A} were specified, both 
the approximate H{Xi) and accurate h{Xi) waveforms 
discussed in Sec. jV Aj were evaluated at each point, and 
the waveform differences {:5/i(Ai)} stored for use during 
the GPR interpolation. The matrix of waveform differ¬ 
ence overlaps = {Sh{Xi)\6h{Xj)) was also evaluated 
and stored for use during the hyperlikelihood maximisa¬ 
tion procedure. 


C. The hyperparameters 


Initially the training sets described in Sec. jVBj were in¬ 
terpolated using the SE covariance function in Eq. (321. 
This covariance function has just two hyperparameters, 
9 = {o’f,gMaMc}- The one-dimensional metric gM^Mc 
can be exchanged for a length scale in the chirp mass 
parameter <5Ale = y/dMcMa ■ A fixed noise term with 
cr^ = 10“^ was used for all the covariance functions in 
this section, to make the inverse of the covariance func¬ 
tion numerically stable as discussed in Sec. Ill G The 
hyperlikelihood for the training set Vq was maximised 
with respect to these two hyperparameters. The opti¬ 
mum values for the hyperparameters were found to be 


= 3.49 X 10“^, (82) 

6Me = l.Il X IQ-^Mq . (83) 


The hyperlikelihood is shown in Fig. The hyperlike¬ 
lihood was also maximised for the training set us- 
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FIG. 3: The hyperlikelihood, from Eq. ( [1^ , for the SE covari¬ 
ance function, maximised over the scale hyperparameter cr/, 
plotted against the chirp mass length scale SMc The hyper¬ 
likelihood is shown for both of the training sets (normalised 
to a peak value of 1). The denser training set Di was found 
to favour smaller length scales. 


ing the same SE covariance function and those results 
are also shown in Fig. For the denser training set 
Vi the optimum length scale was found to be smaller, 
SMc = 6.31 X 1O“^M0. For both training sets, in the 
limit that the length scale becomes much larger than the 
total width of the training set (O.GMq) or much smaller 
than the grid point spacing {AMc), the hyperlikelihood 
tends to a constant value. This behaviour can be un¬ 
derstood by examining the expression for the hyperlike¬ 
lihood in Eq. (181. 

In order to explore the effect that the choice of co- 
variance function has on the marginalised likelihood, the 
training sets were also interpolated using the Matern co- 
variance function in Eq. (35). This covariance function 
has an additional hyperparameter, 9 = {af,gM. 

The hyperlikelihood for training set Vq was maximised 
for this covariance function. It was found that the hyper- 
likelihood surface did not possess a peak, instead a ridge 
was found tending to a maximum at a value ry —)■ oo, and 
values of (7/ and were found to be the same as 

for the SE covariance function. In Fig. we plot the log- 
hyper likelihood (maximised over cr/) against chirp-mass 
length scale and the additional hyperparameter rj. 

As the Matern covariance function recovers the SE 
function in the limit ry —)■ oo, there will be no difference in 
the performance of the interpolants for this training set 
when using the Matern or SE covariance functions. If the 
volume under the hyperlikelihood surface (the hyperev¬ 
idence) is used as a figure-of-merit for which covariance 
function the data favours, then in this case the data is 
equally well described by either covariance function, but 
the SE covariance function is favoured over the Matern 
due to the smaller prior volume (the Occam penalty). 

The hyperlikelihood was also calculated for both train¬ 
ing sets Vq and Vi using the PLE covariance, see 
Eq. (33), and the Cauchy covariance, see Eq. (34), con¬ 
sidered in Sec. I 


In both cases a similar behaviour was 
observed. For the PLE covariance, a peak in the hyper- 
likelihood was found at ry = 2, where the PLE covariance 
equals the SE covariance. For the Cauchy covariance, a 



FIG. 4: The hyperlikelihood, from Eq. (18 1 , surface for the 
training set (Do using the Matern covariance, maximised over 
the hyperparameter u /, plotted against the chirp mass length 
scale SA4c and the hyperparameter, rj. The hyperlikelihood 
does not show a clear peak, instead a ridge in the hyperpa¬ 
rameter space favours the limiting case »y —>■ oo, in which limit 
the Matern covariance function is equal to the SE covariance 
function. On the near-side faces of the plot box we show the 
hyperlikelihood sliced parallel to the coordinate axes though 
the point {SMc ~ lO~^'®M 0 ,?y = 10). The solid black line 
on the near, left-hand face of the box very closely matches 
the solid black curve in Fig. (up to an arbitrary additive 
constant). 


ridge in the hyperlikelihood was found tending to a max¬ 
imum for ?y —> oo (similar to the Matern case shown in 
Fig. 1^, in which limit the Cauchy covariance also recov¬ 
ers the SE covariance. As with the Matern covariance, 
if the hyperlikelihood is used as a figure-of-merit for se¬ 
lecting the covariance function then the SE covariance is 
favoured over both the PLE and Cauchy functions due 
to the Occam penalty. 

It is clear that interpolations of the training sets 
Vq and Vi using any of the PLE, Cauchy, or Matern 
covariance functions, evaluated at the hyperlikelihood- 
maximising hyperparameters, would yield identical re¬ 
sults to an interpolation using the simpler SE covariance. 
For this reason, in the following sections we do not use the 
PLE, Cauchy, or Matern functions further and instead fo¬ 
cus on the SE covariance function. We will, however, also 
consider using the Wendland polynomial function in the 
following sections as it reduces the computational cost. 

The hyperlikelihood for the compact support Wend¬ 
land polynomial covariance functions are shown in Fig.[^ 
for the cases </ = 0, I, 2, 3. The compact-support func¬ 
tions can develop multiple-peaks in the hyperlikelihood 
surface associated with the length-scale of the training 
set: multiples of the training-set grid spacing are indi¬ 
cated with vertical blue lines in Fig. These subsidiary 
peaks occur in the SMc hyperparameter because as the 
size of the compact-support region grows, the (integer) 
number of training set points it contains changes discon- 
tinuously. 

From Fig.[^it can be seen that for the training set Vq, 













18 



FIG. 5: The hyperlikelihood, from Eq. (181, for the training 
set 'Do using the Wendland polynomial covariance functions, 
maximised over the scale hyperparameter a/, plotted against 
the chirp-mass length scale 5Mc- The vertical blue lines in¬ 
dicate multiples of the training-set grid spacing AA4c- 


a value of g = 1 is favoured with a length-scale SMc = 
4.37 X 1 O“^M0. In the following sections we will use 
Wendland covariance function with all values of q (and 
their associated peak hyperlikelihood length scales) to 
interpolate ’Dg. 

The optimum hyperparameters depend on the detec¬ 
tor noise power spectral density via the overlap matrix 
{dh{Xi)\Sh{Xj)). In App.[^ an investigation of the sensi¬ 
tivity of the optimum hyperparameters to small changes 
in the detector noise properties is described. It was found 
that for any realistic changes to the noise curve, the op¬ 
timum hyperparameters were changed by an amount too 
small to have any noticeable effect on the interpolant. 


D. The interpolated waveforms 

The GPR waveform H{X) — fi{X) could be viewed as 
a new waveform approximant formed from the approxi- 
mant waveforms and the use of GPR on the training set 
of accurate waveforms. It is then natural to ask how this 
new approximant compares to the original ones. This 
can be assessed by calculating the overlap between the 
different waveforms, where the overlap is defined by 

overMa,i) = d|!|_, (84) 

using the inner product defined in Eq. ([^. 

Only considering the overlap misses the important ex¬ 
tra benefit which the marginalised likelihood approach 
brings. Our method is not just supplying a new wave¬ 
form approximant, but also providing a way of modifying 
the posterior to account for the uncertainties known to 
be in the approximant. This extra information which 
modifies the likelihood surface is included through (t(A). 
Nonetheless, it is still informative to temporarily treat 
H{X) — fJ-{X) as if it were a new waveform approximant 
and see how it compares with the approximants h{X) and 


Overlap with accurate waveform 
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FIG. 6: A plot of the overlap between the interpolated wave¬ 
form H{X) — p(A) and the accurate waveform h{X) as a func¬ 
tion of the chirp mass A4c- The bottom panel is the same 
plot with a different ordinate axis scale. The two black lines 
show the overlap using both training sets. Do and Di, inter¬ 
polated using the SE covariance function. The red line shows 
the overlap between the approximate waveform H{X) and the 
accurate waveform h{X) for comparison. The vertical blue 
lines show the position of the training set points for Do- In 
the bottom panel, it can be seen that, for either interpolant, 
the overlap becomes one when evaluated at the training set 
points. 


Overlap with accurate waveform 



FIG. 7: A plot of the overlap (or overlap) between the interpo¬ 
lated waveform H{X) — /r(A) and the accurate waveform h{X) 
as a function of the chirp mass Me- The different curves cor¬ 
respond to using the Wendland polynomial covariance func¬ 
tions with different values of q to interpolate the training set 
Do- The vertical blue lines show the position of the training 
set points for Do- 
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H{X) from which it was built. Fig.j^shows the waveform 
overlap between the interpolated waveform H (A) — ^(A) 
and the accurate waveform h{X) as a function of chirp 
mass near the edge of the training set. Also shown in 
the dotted curve is the overlap between the approxi¬ 
mate waveform H{X) and the accurate waveform h{X). 
The interpolated waveforms have a much higher over¬ 
lap than the approximate waveforms, as would be ex¬ 
pected. Within the training set the overlap is increased 
from ^ 0.35 to no less than ~ 0.985 even for the sparser 
training set Dq. For the denser training set overlaps 
no worse than ~ 0.999 were found inside the range of 
the training set. Outside the training set the interpo¬ 
lated waveform tends rapidly to the approximate wave¬ 
form H{X). 

The training set waveforms were also interpolated us¬ 
ing the Wendland compact support covariance functions 
discussed in Sec. |IIIDl The cases g = 0, 1, 2, 3 were con¬ 
sidered separately. The waveform overlap using these in- 
terpolants is plotted in Fig. The performance of these 
interpolants should be compared with the results using 
the SE covariance function in Fig. 

The least smooth of the Wendland polynomials, the 
q = 0 case, performs noticeably worse than the SE co- 
variance; inside the training set the overlap drops as low 
as ~ 0.955 compared to ~ 0.985 for the SE. However, 
even a overlap of ~ 0.955 is still a great improvement 
over the overlap of ~ 0.35 for the approximate wave¬ 
form alone. For the g = 0 Wendland polynomial the in- 
terpolant has a discontinuous first derivative, which can 
be s een in Fig. (this is expected and was discussed in 
Sec. Ill and in detail in App. 0. The higher values of 
g have discontinuities in the higher ordered derivatives, 
but these curves look smooth to the eye. The smoother 
Wendland polynomials, with g > 0, all perform very sim¬ 
ilarly to the SE covariance function; inside the training 
set the overlap drops as low as ^ 0.985 for the g = 2 
interpolant. 


E. The GPR uncertainty 


The GPR performs an interpolation of the points in the 
training set and naturally returns a Gaussian error cr(A), 
see Eq. (25), for each interpolated point. In our present 


one-dimensional interpolation this is simply a function of 
Aic- A small section of this curve taken from the edge 
of the training set is shown in Fig. Inside the train¬ 
ing set, the error surface has a regular, periodic pattern 
with minima at the training set points and maxima in be¬ 
tween. This regularity is because the GP used for the in¬ 
terpolation is stationary, the training-set points used are 
regularly spaced, and each point has an identical error (a 
jitter J = 10“^). If these conditions were to be relaxed, 
then the error surface would become more complicated. 
In general, a larger a{X) indicates greater theoretical un¬ 
certainty and highlights regions where we would benefit 





FIG. 8: A plot of the GPR uncertainty cr^(A) as a function of 
the chirp mass parameter for both of the training sets, using 
the SE covariance function. The vertical blue lines show the 
position of the training set points for Vq. Outside of the 
training set the uncertainty tends to a constant . Inside the 
training sets the error is approximately periodic with minima 
at the training set points. The maximum uncertainty inside 
the training set is smaller for the denser training sets. 


GPR uncertainty 



FIG. 9: A plot of the GPR uncertainty cr^(A) as a function of 
the chirp mass parameter for the training set Do, using the 
Wendland polynomial covariance functions. The vertical blue 
lines show the position of the training set points. 


from additional accurate waveforms (e.g., where it would 
be beneficial to perform more NR simulations). 

Near the edge of the training set the behaviour be¬ 
comes less regular and well outside of the training set the 
error tends to a constant value, cr^(A) —> ct/ as A —> co. 
This behaviour is seen in Fig. [^for all three training sets. 
The training sets with smaller grid spacings have smaller 
uncertainties everywhere in parameter space. 

The GPR uncertainty was also calculated using the 
Wendland polynomial covariance functions to interpolate 
the training set Vq', these are shown in Fig. The GPR 
uncertainty, expressed as a fraction of aj, is largest for 
the smallest values of g; this can be traced back to the 
optimum length scale for the Wendland polynomials in¬ 
creasing with g (see Fig.[^. This means that the uncer¬ 
tainty grows more slowly as the interpolating point moves 
away from the training set points, and hence reaches a 
smaller maximum value between training set points. The 
smoother (g > 0) Wendland polynomials perform simi¬ 
larly to the SE covariance function, in the sense that both 
the GPR interpolants (which we quantify via the overlap) 
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and the GPR uncertainties are almost identical. Hence, 
in the following sections we will only consider using the 
SE covariance function; the high q Wendland polynomi¬ 
als would yield identical results. 


F. The likelihood 


Finally we put together the interpolated waveform 
H{X) — fi{X) and the GPR uncertainty cr^(A) to give 
the marginalised likelihood in Eq. (28 1 . We compare the 
performance of the marginalised likelihood >C(A) to the 
approximate likelihood L(X) and the accurate likelihood 
L'{X). For the injected signal we use the accurate wave¬ 
form h{X). We also consider the case where the noise re¬ 
alisation is zero (the most likely realisation), this makes 
comparisons easier. 


We injected a signal at a chirp mass of Me = 5 .O 45 M 0 ; 
this is inside the training set TXq and midway between 
training set points. Injecting the signal midway between 
the points is conservative as this is the point at which 
we would expect the marginalised likelihood to perform 
worst. The three different likelihoods were evaluated as 
a function of chirp mass (all other parameters set to the 
injected values). This was done at a range of SNRs and 
the results are shown in Fig. 10 The top row of panels in 
Fig. 10 show the likelihoods renormalised to a peak value 
of one, this makes the relative positions of the peaks clear 
and easy to compare. The bottom row of panels shows 
the log-likelihood without any renormalisation, this il¬ 
lustrates how the approximate likelihood is suppressed 
relative to the true likelihood (the detection problem dis¬ 
cussed in Sec. |l|. 

The exact likelihood L'{X) is always peaked at the in¬ 
jected value of the chirp mass (because the injected noise 
realisation is zero) and the width of the peak decreases 


with increasing SNR. The approximate likelihood L{X) 
is peaked way from the true value, indicating a system¬ 
atic error of AgygAlc = 5.2 x Mq. The width of 
the approximate likelihood peak also decreases with in¬ 
creasing SNR and for SNR > 12 (which is also roughly 
the detection threshold [3113) the true parameters are 
excluded at increasing significance. The bottom row of 
panels in Fig. [TO] shows that the approximate likelihood 
is suppressed by a significant amount, for a typical SNR 
of 16 it is supressed by 80 in log relative to the ex¬ 
act likelihood; this reduces the Bayesian evidence for a 
detection. The factor by which the approximate like¬ 
lihood is suppressed increases exponentially with SNR. 
Finally, the marginalised likelihood is peaked much closer 
to the exact likelihood: the systematic error is reduced 
to AgysAlc = 9.0 X 1 O“^M0. However, as discussed in 
Sec. |IVB[ the peak in the marginalised likelihood does 
not continually narrow as the SNR increases; for SNR 
> 30 the width becomes constant. Gonsequently, the 
true parameters are never excluded at high significance; 
in the limit of infinite SNR the true parameters lie at the 


~ Icr level. The bottom panel of Fig. shows that the 
marginalised likelihood is not suppressed relative to the 
exact likelihood in the vicinity of the peak. 

Gomparing the properly normalised likelihoods, we see 
that the marginalised and exact likelihoods roughly agree 
at low SNR. As the SNR is increased, the marginalised 
likelihood deviates from the exact likelihood and devel¬ 
ops oscillatory behaviour with period equal to the train¬ 
ing set point spacing. In the limit of low SNR, all of the 
parameter estimation uncertainty comes from the noise, 
but as the SNR increases, the relative size of this statis¬ 
tical uncertainty becomes smaller and at high SNR we 
are dominated by model uncertainty. The marginalised 
likelihood correctly encapsulates this behaviour, as can 
be seen in the sequence from left to right in Fig. 


VI. SUMMARY 

In [13] , some of the authors suggested GPR as a means 
of incorporating theoretical uncertainty into GW data 
analysis. We have now thoroughly investigated the prop¬ 
erties of this method, elucidating considerations for a 
practical implementation. A detailed derivation of the 
marginalised likelihood, and the use of GPR to interpo¬ 
late model error was presented in Sec. [T^ GPR is non- 
parametric, in the sense that only the functional form 
of the covariance function is specified by hand, with its 
hyperparameters then learnt from the training set, mak¬ 
ing it well suited to modelling theoretical uncertainty. 
The expression for the marginalised likelihood derived in 
Sec. |lT| made some assumptions about the frequency co- 
variance of the waveforms in the training set; in particu¬ 
lar it was assumed that at a particular point in parameter 
space, the model error is highly correlated in frequency. 
These assumptions may prove to be too restrictive in the 
future, and could be relaxed by simultaneously perform¬ 
ing GPR interpolation in frequency (as well as in parame¬ 
ter space) on the training set waveforms; this would have 
the added advantage of allowing the inclusion of wave¬ 
forms with different frequency samplings, which may be 
beneficial when using multiple waveform approximants 
and NR waveforms from multiple sources. 

The choice of covariance function is central to GPR 
as it encodes our prior beliefs about the function space 
that we are interpolating. We discussed various choices 
of covariance function in Sec. IHIl We have found that 
the simple SE covariance function (as used in 1131) per¬ 
forms as well as more complicated alternatives, at least 
for the relatively small one dimensional training sets con¬ 
sidered here. The compact-support Wendland covariance 
functions with large q were found to perform comparably 
to the SE, but offer the additional advantage of reduced 
computational cost. This makes them appealing for fu¬ 
ture work involving larger training sets. 

We proved a number of properties for our marginalised 
likelihood in Sec. EYI in particular its limiting behaviour 
for large signal amplitude (where the theoretical errors 
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FIG. 10: A plot of the different likelihoods for a variety of SNRs. Vertical lines indicate the position of training set points. The 
top row of panels show the likelihood normalised to the same peak value; this makes the peak positions clear and shows how 
the marginalised likelihood tackles the parameter estimation problem associated with the inaccurate models. The bottom row 
of panels shows the log-likelihood; this makes the suppression of the peak value of the approximate likelihood clear and shows 
how the marginalised likelihood could be used to tackle the detection problem. 


are known to be most significant [12]) and its limiting 
behaviour both far from and near a point in the training 
set. In the discussion of the latter, the linearised results 
previously obtained in [T^ were recovered. All of these 
properties demonstrate the suitability of GPR for mak¬ 
ing robust inferences. The marginalised likelihood suc¬ 
cessfully describes our belief in our inferences, including 
our uncertainty in waveform templates. 

In Sec. jVj we presented a one-dimensional implemen¬ 
tation of our marginalised likelihood and demonstrated 
that it offers an improvement in PE accuracy. For this, 
we chose two inexpensive waveforms to aid computation; 
in real data analysis situations, we expect more accu¬ 
rate waveforms (including those calculated using NR) to 
be used in the training. However, this choice of wave¬ 
forms does illustrate the efficacy of the new marginalised 
likelihood. In particular we find that even when using 
qualitatively different waveforms (inspiral-only TaylorF2 
compared to inspiral-merger-ringdown IMRPhenomC), 
waveform matches as high as ~ 98.5% can be obtained in 
the mass range we considered. Restricting ourselves to 
the simple case of one-dimensional PE, we explored vari¬ 
ous possibilities for GPR. In particular, the effect of dif¬ 
ferent training set sizes was examined; as expected, the 
performance of the marginalised likelihood is improved 
by using denser training sets. Additionally, the impact 
of varying the SNR of the injected waveform was studied. 
In the standard likelihood model, errors become more se¬ 
vere as the SNR is increased, but we confirmed that even 
in the limit of large SNR, the marginalised likelihood 
remained consistent with the injected parameters. We 
expect these results to carry over to a full multidimen¬ 
sional analysis, which is the next step in developing this 
technique. 

A possibly complementary use for the GPR approxi- 
mant is as a less expensive alternative to the accurate 
waveform for more expedient PE; however, there exist 


other means of constructing computationally inexpensive 
waveforms, such as reduced-order modelling |5U|BS|. The 
advantage of GPR is that it not only supplies an inter- 
polant, but also gives an uncertainty, which can be used 
to gauge accuracy away from training points. A second, 
related feature is that GPR naturally allows for uncer¬ 
tainty to be included in the accurate models that the 
interpolant is calibrated against. It is the ability of GPR 
to include theoretical uncertainty that makes it attrac¬ 
tive for GW astronomy, that this can be done without 
significant on-line cost is a welcome bonus. 

In conclusion, marginalising over waveform uncer¬ 
tainty is a robust and effective method of accounting 
for theoretical error in both PE and detection problems. 
GPR is a natural and effective means of performing this 
marginalisation. The marginalised likelihood is naturally 
inferior to a likelihood calculated with more accurate (but 
inevitably more computationally expensive) waveforms, 
but it offers significantly improved performance over the 
standard likelihood calculated with cheap waveforms. In 
addition, the marginalised likelihood is almost as quick 
to evaluate online as the standard likelihood, although 
there is additional offline computation required to con¬ 
struct the training set and train the Gaussian process. 
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Appendix A: Continuity and differentiability of GPs 


In this appendix we give proofs of the results stated in 
Sec. |III| concerning the continuity and differentiability of 
GPs, following the approach of [33] ■ Let Ai, A 2 , A 3 ... be 
a sequence of points in parameter space which converges 
to a point A^. , in the sense lim^_>.oo |A^ — A*| = 0, where, 
as in Section III |a?| denotes the norm with respect to the 
metric on parameter space, as discussed in Section [ill A| 
The GP Y (A) is said to be MS continuous at A* if 


lim E 

^—>•00 


(y(A^)-Y(A*)|r(A^)-y(A*))] =o, (ai) 


where E[...] denotes the expectation of the enclosed 
quantity over realisations of the GP. For notational con¬ 
venience, we denote this MS limit as 

y(A*) =l.i.m.y(A,), (A2) 

£—>■00 


where l.i.m. stands for limit in mean |66j . MS continuity 
implies continuity in the mean. 


lim E 

£.—¥00 


y(Af) - r(A*) 


= 0 . 


(A3) 


This follows from considering the variance of the quantity 
Y (A^) —Y(A*), and the fact that variance is non-negative. 
There are other notions of continuity of GPs used in the 
literature, but the notion of MS continuity relates most 
easily to the covariance. The mean and the covariance of 
a GP are defined as 


m(A) = E y(A) 


^(Ai,A 2 ) = E (y(Ai) - m(Ai) y(A 2 ) - m(A 2 


(A4) 


Using these, Eq. (AI) can be written as 

lim < fc(A*, A*) — 2fc(A^, A*) -I- k{Xi, Xg) 

1^00 L 

-I- ^to(A*) — m(Af) to(A») — TO(A£)j I = 0,(A5) 
and using the continuity of the mean in Eq. (A3) gives 


lim 

^—>•00 L 


^(A^, A:^c) — 2k{Xi^ A^) + A;(A^, A^) — 0. (-A-6) 


This condition is satisfied if the covariance function is 
continuous at the point Ai = A 2 = A*. Therefore, we 
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arrive at the result that if the covariance function is con¬ 
tinuous in the usual sense at some point A*, then the 
corresponding GP is MS continuous at this point. In 
fact, a GP is continuous in MS if and only if the co- 
variance function is continuous |39j . although this is not 
proved here. In the special case of stationary covariance 
this reduces to checking continuity of k{f) at r = 0, and 
in the special case of isotropic covariance, continuity of 
fc(r) at r = 0. 

We now move on from continuity to consider differen¬ 
tiability. In the spirit of Eq. ( AI[ ), the notion of taking 
the MS derivative of a GP is defined as 


ay(A) 

5A“ 

where Xa{X, e) 


l.i.m. Xa(A,e), 

£->0 

y(A -l- e Cg) — Y (A) 
e 


(A7) 

(A8) 


with parameter-space unit vector Ca- This definition can 
be easily extended to higher-order derivatives |39j . The 
MS derivative of a GP is also a GP; this follows simply 
from the fact that the sum of Gaussians is also distributed 
as a Gaussian. The covariance of Xa{X,e) is given by 


K, 


(Ai,A2 ) = E[(Aa(Ai,e)-S(Ai,e) 
-^o(A 2, e) — ■^(A2, e 


(A9) 


where Sq(A, e) = E[Aa(A, e)]. It then follows that 
fc(Ai -|- e, A 2 -|- e) — fc(Ai, A 2 -I- e) 


Ke{Xi^X2) — 


k{Xi + e, A 2 ) — fc(Ai, A 2 ) 


(AlO) 


Substituting this into Eq. (A8), the limit in MS be¬ 
comes a normal limit, and the result is obtained that 
the MS derivative of a MS continuous GP with covari¬ 
ance function A:(Ai, A 2 ) is a GP with covariance function 
d‘^k{Xi, X 2 )/ 8 X 18 X 2 ■ In general the covariance function 
of the Ud-times MS differentiated GP 


9"'‘y(A) 

is given by the 2nc;-times differentiated function 

_ 8^^^kiXi,X2) _ 

aA“^9A“^aAf (9A“= ... 8X1^-^ 8X1'^'^ ' 


(All) 


(A12) 


From the above results relating the MS continuity of 
GPs to the continuity of the covariance function at 
Ai = A 2 = A*, it follows that the n^-times MS deriva¬ 
tive of the GP is MS continuous (the GP is said to be 
Urf-times MS differentiable) if the 2nd-times derivative 
of the covariance function is continuous at Ai = A 2 = A* 
[M] . So it is the smoothness properties of the covariance 
function along the diagonal points that determines the 
differentiability of the GP. (It can also be shown that if 
a covariance function is continuous at all diagonal points 
Ai = A 2 then it’s everywhere continuous.) 


Appendix B: The effect of small changes in the noise 
PSD on the GPR interpolant 


In the offline stage of the method, the GP was trained 
using the hyperlikelihood in Eq. (18). The result of this 


process was an interpolant which enabled fast online PE. 
However, this splitting into offline and online stages also 
has a potential problem, because the training process 
makes use of the overlap matrix Mij = {Sh{Xi)\Sh{Xj)) 
which, in turn, depends upon the detector noise PSD 
Sn{f)- The noise PSD is not constant; it changes on short 
timescales as the noise drifts in the instrument (e.g., [57]), 
on longer timescales it changes more dramatically as the 
instrument is gradually upgraded [5|. There are also dif¬ 
ferences between different detectors, for example between 
the aLIGO and AdV instruments (or even between the 
two aLIGO interferometers). It would be a significant 
drawback if the offline training stage of the process had 
to be repeated for every single candidate signal because 
of small differences in the detector PSD. 

We do not expect small changes in the noise curve 
to have a significant effect on the resulting interpolant. 
First, the noise can be rescaled by an overall constant 
and have no effect on the position of the peak in the hy¬ 
perlikelihood; this can be seen from Eq. (181. Second, the 


peak in the hyperlikelihood is typically wide, and using 
the hyperparameters from anywhere in the vicinity of the 
peak still gives reasonable, if not perfect, interpolation. 
Accordingly, when the PSD changes, some of the differ¬ 
ence can be absorbed by an overall scaling, which has 
no effect on the results, and the remaining change shifts 
the peak of the hyperlikelihood away from the previously 
optimised values, but not enough to limit their applica¬ 
bility. If this is the case, then GPs trained on slightly 
different noise PSDs perform nearly identically to each 
other and there is no need to retrain for the new PSD. 

To assess the sensitivity of our results to changes 
in the noise curve, we considered three different noise 
curves chosen to represent the range of possibilities in the 
advanced-detector era. These are: an estimate of the ob¬ 
serving run 1 (01) aLIGO sensitivity (the early curve of 
[55]); the zero-detuned high-power (ZDHP) design sensi¬ 
tivity of aLIGO [Ul^, and the design sensitivity of AdV 
laiTo]. As an additional check, we also considered an in¬ 
verted top-hat noise curve. All of these noise curves are 
plotted in the left-hand panel of Fig. [m We then took 
the training set T>o and trained the SE GP to find the 
optimum hyperparameters. Shown in the centre-panel 
of Fig. E] is the hyperlikelihood surface as a function of 
chirp mass length scale for the different noise curves. As 
expected, for the range of realistic noise curves the peak 
in the hyper likelihood only shifts by a small amount. Fi¬ 
nally we used the optimum hyperparameters from each of 
these hyperlikelihood surfaces to interpolate the training 
set and calculated the overlap to the accurate waveforms 
using the ZDHP noise curve; the results of this are shown 
in the right-hand panel of Fig. El For the range of re¬ 
alistic noise curves, the overlap is equally good (cf. [51]1. 
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FIG. 11: The left-hand panel shows three different noise curves for ground-based detectors in the advanced era. Also shown in 
red is an unrealistic noise which we use for comparison. The centre-panel shows the hyper-likelihood surface for the training 
set Do using the SE covariance function and with the overlap matrix calculated using each of the noise curves in the left-hand 
panel. The right-hand panel shows the waveform overlap between the accurate and the interpolated waveform evaluated for 
parameter values between two training set points. The interpolants based on the realistic noise curves perform equally well 
(the curves lie on top of each other). The unrealistic noise curve performs worst, but still gives overlaps greater than 0.994. 


Although the inverted top-hat noise curve gives notice¬ 
ably lower overlaps, even in that case the drop in the 
overlap is still less than 0.1%, which is smaller than the 
difference between the approximate and GPR likelihoods. 


This suggests it is safe to train a GP with a fixed noise 
curve (typical for the instruments considered). The re¬ 
sulting interpolants can be used to analyse all signals 
without worrying about small drifts in the noise. 













